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Abstract 


oltzmann  Machines  learn  to  model  the  structure  of  an  environment  by  modifying  internal 
weights.  The  algorithm  used  for  changing  a  weight  depends  on  collecting  statistics  about  the 
behavior  of  the  two  units  that  the  weight  connects.  The  success  and  speed  of  the  algorithm 
depends  on  the  accuracy  of  the  statistics,  the  size  of  the  weight  changes,  and  the  way  in  which  the 
accuracy  of  the  machine’s  model  varies  as  tite  weights  are  changed.  This  paper  presents  theoretical 
analysis  and  empirical  results  that  can  be  used  to  select  more  effective  parameters  for  the  learning 
algorithm. 
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1.  Introduction 

This  paper  assumes  familiarity  with  nolt/.mann  Machines,  and  the  learning  algorithm  described  in  (Hinton  ct 
al,  1984).  As  discussed  there,  the  learning  parameters  affect  the  accuracy  of  the  estimate  of  the  gradient  of  llte 
cost  function,  G,  at  a  point,  and  how  that  estimate  is  used  to  select  the  next  point  to  be  investigated. 

The  first  section  of  tliis  repot  t  discusses  G  in  general  terms  and  presents  some  intuitions  about  its  topography. 
This  is  followed  by  two  sections  devoted  to  empirical  analysis  of  the  results  of  varying  parameters  of  the 
learning  algorithm  in  an  effort  to  speed  up  the  process.  Sections  5  and  6  discuss  specific  problems 
encountered,  and  the  conclusion  speculates  on  modifications  to  the  cost  function  which  may  further  improve 
the  learning. 

2.  General  Results  about  G-space 

2.1.  Description  of  G-space 

The  Bolt/.mann  Machine  learns  to  make  its  model  ofits  environment  coi respond  to  the  actual  environment  in 
which  it  is  placed.  An  cnviionment  is  a  probability  distribution  of  patterns  over  a  subset  of  the  units  called 
the  visible  units.  The  environment  can  be  specified  explicitly  as  a  list  of  ordered  pairs,  giving  a  pattern,  V^, 
over  the  visible  units,  and  a  probability  for  the  occurrence  of  the  pattern^  The  machine’s  model  is  just 
the  probability  distribution  it  would  produce  over  the  v  isible  units  if  it  were  allowed  to  run  freely  without  any 
environmental  input.  This  probability  distribution  is  not  stored  directly.  Instead,  it  is  specified  implicitly  by 
the  magnitude  of  die  weights  in  the  machine.  For  a  machine  architecture  with  v  visible  units,  there  are  2’' 
possible  patterns.  A  machine  whose  weights  were  all  zero  would  implicitly  specify  an  environment  where 
each  of  these  patterns  had  piobability  22J,  since  the  energy  of  all  states  would  be  the  same,  namely  0.  In  all 
the  environments  we  have  investigated,  the  number  of  states  that  occur  is  very  small,  on  tlt^rdcr  of  v  rather 
than  2*’.^ 

Sampling  the  probability  distribution  implicit  in  the  machine’s  connections  is  accomplished  by  simulated 
annealing  in  Energy  Space.  The  points  in  tliis  space  correspond  to  the  V  ‘‘ ''  discrete  states  of  the  v+  h  units; 
the  value  of  E  at  each  point  is  determined  by  the  weights,  which  remain  fixed  during  the  annealing  process. 

The  learning  procedure  attempts  to  change  the  weights  to  minimize  the  distance  between  the  two  probability 
distributions,  as  measured  by  the  function 

(7=  ^  /'(KJlog-^^^ 
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see  .section  4  1  for.nn  evampio  of  an  environment. 
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where  P{Va)  is  the  environmental  probability  for  state  and  /’'(Ka)  is  the  probability  derived  from  the 
machine’s  model.  (7  is  a  function  of  die  W  weights,  and  lies  on  a  W  dimensional  surface  within  the  W+l 
dimensional  space  we  call  G-spacc.  Optimization  is  done  by  finding  the  global  minimum  (or  a  good  local 
minimum)  over  this  surface. 

In  many  ways  optimization  in  this  continuous  space  is  like  the  optimization  in  the  discrete  energy  space.  It  is 
important  not  to  confuse  the  two;  one  iteration  of  the  optimization  process  for  G  requires  many  complete 
optimizations  in  Zf-space. 


2.2.  Analytical  Results 

G  is  a  rather  nice  function,  and  several  important  results  can  be  found  analytically.  The  principle  one  used  by 
the  learning  procedure, 


a  Wij ,  "  T  P 


(1) 


was  given  in  (Hinton  ct  al  1984).  p.j  is  the  probability  that  two  units  arc  both  on  when  che  visible  units  are 
clamped  into  environmentally  specified  states,  and  p',j  is  the  corresponding  probability  when  the  visible  units 
arc  not  clamped.  We  sample  these  probabilities  by  running  the  machine  in  two  phases.  When  it  is  running 
under  die  influence  of  the  environment,  we  metaphorically  say  it  is  in  the  wake  phase;  when  free  running  it  is 
in  the  sleep  phase.  By  sampling  these  two  probabilities  we  can  estimate  the  gradient  of  G,  and  use  dtis  as  the 
basis  for  optimization.  T  is  only  a  scale  factor  for  the  weights;  doubling  T  and  doubling  all  die  v'cights  will 
not  affect  die  behavior  of  the  machine,  nor  will  it  affect  the  learning  provided  that  increments  in  the  weights 
arc  also  sealed  appropriately.  For  simplicity,  it  will  be  assumed  throughout  diis  paper  that  the  annealing 
searches  in  E-spacc  use  a  final  temperature  of  1.0. 


The  following  result  about  the  smoothness  of  G  gives  us  a  guarantee  that  we  can  change  the  weights  by  a  fixed 
proportion  of  the  gradient  and  continue  to  descend.^ 


,  d^G 
'  ds^ 


(2) 


where  IF  is  the  number  of  weights,  and  s  is  any  unit  vector  in  G-space.  To  accomplish  steepest  descent,  let  s 
point  away  from  the  gradient  vector.  Then  in  the  worst  ease,  will  increase  toward  zero  by  lF/2  for  every 
unit  distance  moved  in  direction  s.  Thus,  we  can  always  go  a  distance  of  while  continuing  to  decrease 

G,  and  twice  this  distance  without  ending  up  worse  diaa  where  we  started.  In  practice,  this  is  an  extremely 
conservative  bound.  When  deciding  how  far  to  move  in  G  space,  one  must  also  take  into  account  the 
reliability  of  the  estimates  of  p,j  and  p',j,  good  estimates  require  longer  sampling  times.  It  has  been  found  that 
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See  .ippciidix  I  for  ilic  derivation. 
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moving  a  disuancc  of  |  V  (7„,|  works  well  for  the  problems  we  have  studied.^ 

Here  arc  two  more  interesting  theorems  about  the  topography  of  (7: 

•  If  all  weights  but  one  are  fixed,  there  is  exactly  one  value  for  the  remaining  weight  which  is  a  local 
minimum  of  G. 

o  Corollary:  There  are  no  local  maxima  in  G-space  (though  there  may  be  several  local 
minima). 

•  With  no  hidden  units,  G  is  concave  upward  in  all  directions. 

fhe  proof  of  the  first  theorem  involves  showing  that  the  curvature  of  G  along  an  axis,  ij,  is  the  variance  of  the 
cooccurence  function,  b,j.  That  of  the  second  involves  showing  that  the  curv  ature  in  an  arbitrary  direction  is  a 
quadratic  form  involving  the  direction  vector  and  die  covariance  matrix  of  6. 

2.3.  What  G  Looks  Like 

For  most  environments  that  we  have  studied  there  are  a  few  equally  probable  desired  states,  and  many 
improbable  states.  This  environmental  structure  and  the  previous  theorems  lead  to  the  following  mental 
image  of  G:  In  polar  coordinates,  G  is  very  much  dependent  on  the  angles,  but  not  so  much  on  the  distance 
from  the  origin.  More  specifically,  for  any  point  in  G-spacc,  G  will  be  more  or  less  monotonic  on  the  path 
from  the  origin  to  the  point,  and  on  toward  infinity.  Basically  the  idea  is  this:  For  a  given  set  of  weights,  if  wc 
increase  tlicm  all  proportionally,  they  will  increase  the  probability  difference  between  high  and  low 
probability  states.  If  they  lowered  G  with  respect  to  its  value  at  the  origin,  increasing  them  will  further 
decrease  G,  and  vice  versa. 

To  be  more  precise,  it  can  be  shown  that 

0  r 

where  r  is  the  polar  coordinate  radius,  <E>  is  the  average  energy  when  tlie  environment  clamps  the  visible 
units,  and  <E'>  is  the  average  energy  when  the  machine  is  free  ninning. 

Thus  claiming  that  scaling  the  weights  will  enable  them  to  do  whatever  they  were  doing  more  effectively 
entails  claiming  that  <E>-<E'>  doesn’t  change  sign  with  the  scaling.  Empirical  investigation  of  G  for  a  very 
simple  problem  corroborates  this  conclusion  (sec  figure  2*1). 

When  examining  a  two  dimensional  cross  section  of  G,  it  is  important  not  to  over  generali/.e.  In  figure  2-1  it 


^Typically  |  V  (7,.,,!  >  |  V  (7| ;  see  section  3.1. 


Figure  2*1: 

Two  dimensional  cross  section  of  G  for  the  I- 1- 1-1  Encoder 
problem.'* 

appears  tliat  tlic  solution  lies  in  the  ravine  sloping  down  to  die  right.  In  fact,  one  should  only  conclude  that 
the  best  trade-off  between  the  independent  vectors  X  and  Y  in  C-space  given  that  everything  else  must 
remain  fixed  is  to  have  Y  small  and  negative  and  X  large  and  positive.  It  is  quite  possible  that  by  changing 
some  other  vector,  a  better  ravine  could  be  found. 

Getting  out  of  local  minima  presumably  involves  getting  out  of  the  wrong  ravine  and  into  the  right  one. 
Obviously  this  is  much  easier  if  the  errant  weights  are  small,  since  the  ravines  seem  to  converge  at  the  origin. 
Since  the  ridges  between  ravines  rise  monotonically,  for  moderate  sized  weights  it  is  almost  certain  that  the 
machine  will  be  in  one  of  die  ravines.^  The  differences  in  die  heights  of  the  floors  of  the  ravines  are  small 
enough,  however,  that  it  is  difficult  to  select  the  right  one;  generally  there  arc  more  wrong  than  right  choices. 

3.  Search  Methods 


4 

The  l-l-l-l  Encoder  is  desenbed  in  section  4.1. 

^The  amount  of  noise  in  tiic  search  dcicrmincs  how  close  to  ihe  bottom  of  a  ravine  the  machine  generally  stays. 
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3.1 .  Steepest  Descent  with  Noise 

In  section  2.  optimization  was  discussed  as  if  the  goal  was  to  descend  in  G  as  fast  as  possible.  Since  G  is  not,  in 
general,  concave  upward,  and  globally  rather  than  merely  locally  optimal  combinations  of  u  eights  arc  desired, 
occasional  uphill  steps  in  G  must  be  allowed,  just  as  occasional  uphill  steps  in  /•’  were  allowed.  It  is  less 
straightforward  to  define  a  temperature  for  searching  G  and  to  use  it  for  annealing  than  it  is  for  /.'.  Currently 
the  possibility  of  going  uphill  is  provided  in  two  ways,  l-’iist,  there  is  inevitably  some  noise  introduced  when 
estimating  the  gradient  of  G  by  sampling  coocurrcncc  rates,  so  G  may  actually  inemne  in  the  estnnuied 
direction  of  steepest  descent.  Second,  we  move  farther  in  G  tlian  is  guaranteed  by  the  smoothness  '"csults.  As 
will  be  shown,  the  estimates  we  have  been  using  arc  poor,  and  lead  to  gross  overestimates  of  the  magnitude  of 
die  gradient;  the  fact  tliat  we  use  a  step  si/.c  that  would  e.xcccd  the  smoothness  guarantee  even  with  peifect 
estimates  is  largely  irrelevant. 

Many  of  the  various  search  techniques  we  arc  experimenting  with  can  be  viewed  as  attempts  to  modify  the 
characteristics  of  tlic  noise  so  as  to  facilitate  rapid  descent  in  G,  while  preserving  the  ability  to  escape  from 
merely  local  optima.  Before  analyzing  how  we  can  improve  our  noise,  let  us  examine  what  it  looks  like  to 
begin  with: 

The  following  assumptions  lead  to  a  simple  expression  for  the  variance  of  the  magnitude  of  the  estimated 
gradient: 

1.  'Fhc  number  of  weights  is  large 

2.  The  samples  arc  independent 

3.  All  units  Independently  come  on  half  the  time 
The  derivation  in  appendix  II  gives  the  following  result: 

«-(iivc,,,ii)  =  2iv7r- 

where  Wis  the  number  of  weights,  and  ,V  is  the  number  of  samples.  For  a  simple  know  led. 'j  represent, 'tiun 
problem,  I  used  IF=271,  /V=7.  A  typical  value  for  the  magnitude  of  the  gradient  vvas  picib.ddy  one  or  two. 
Thus  the  variance,  11,  was  much  larger  tlian  the  actual  value. 

Contrary  to  assumption  2  above,  the  samples  used  to  estimate  /Zy  and  p',j  arc  highly  coricl.itea  since  scqu^jnrial 
samples  of  the  global  suite  differ  by  at  most  one  unit.  To  combat  this,  we  usually  anneal  sev  .i,il  times  in  each 
sleep  or  wake  cycle,  producing  several  independent  sequences  of  samples. 
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3.2.  Moving  a  Constant  Distance  in  G*space 

One  way  to  avoid  making  very  large  moves  wlien  tlie  estimated  gradient  is  very  large  is  to  always  move  a 
constant  distance  in  each  learning  cycle.  The  large  variance  in  the  estimate  of  tlie  magnitude  of  V  r7  suggests 
that  tliis  may  perform  better  tlian  die  previous  method.  If  minima  tend  to  have  diameters  of  order  10  (with 
r=  1).  moving  a  distance  of  1  every  time  may  be  a  good  compromise  between  the  possibility  of  not  finding  a 
minimum  when  close  to  it  and  die  possibility  of  remaining  too  long  in  a  sub-optimal  minimum.  In  contrast, 
die  proportional  technique  can  puiducc  the  wrong  sort  of  feedback:  when  at  the  bottom  of  a  minimum,  die 
gradient  will  be  small,  and  little  searching  will  be  done.  When  far  from  a  minimum,  the  gradient  will  be  large, 
leading  to  overshoot.  In  fact,  unbounded  oscillations  were  the  reason  for  originally  trying  the  constant 
distance  technique. 

The  technique  requires  non-local  information,  because  the  change  to  any  weight  is  dependent  both  on  its  p,j 
and  p'lj,  and  on  the  magnitude  of  the  gradient  estimate,  which  depends  on  all  of  the  p,j  and  p',j.  Further, 
results  in  the  next  section  show  that  this  performs  significantly  \wrse  than  die  proportional  technique  for  a 
very  small  problem. 

3.3.  Changing  Each  Weight  by  a  Fixed  Increment 

A  related  technique  is  to  increment  or  decrement  each  weight  by  a  fixed  amount,*’  depending  only  on  the  sigiH 
of  This  will  also  result  in  moving  a  constant  distance  in  (7  on  each  .■.top,  however  it  has  the  advantage 
dial  all  weights  are  modified  every  learning  cycle  and  it  docs  not  require  non-local  information.  With  the 
previous  technique,  which  only  moves  in  the  direction  of  the  gradient,  the  machine  could  spend  all  its  effort 
sloshing  back  and  forth  up  the  sides  of  a  ravine.  With  the  fixed  increment  method,  it  will  also  make  progress 
along  the  ravine. 

The  fixed  increment  technique  also  has  the  advantage  that  it  is  compatible  with  two  valued  (or  small  integer 
valued)  weights,  for  which  simpler,  and  hence  faster,  simulation  techniques  suffice.  In  addition,  it  forces  the 
machine  to  pick  a  representauon  dial  does  not  require  very  precise  coordination  between  the  numerical 
values  of  weights.  An  example  of  this  undesirable  behavior  is  where  a  positive  and  a  negative  weight  must 
differ  in  magnitude  by  a  precise  amount  to  achieve  the  desired  behavior.  Such  combinations  of  values  arc 
hard  to  learn  and  often  lead  to  suicidal  behavior^  of  the  units  involved. 

Note  that  the  fixed  increment  mcdiod  ib  not  a  siecpeit  descent  technique. 


*’lIimon  dal.  1984 
^.scc  section  5.1.1 
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3.4.  Ravine  Search 

A  class  of  non-siccpcsi  descent  techniques  are  ravine  search  methods.®  I  lie  c.xistence  of  ravines  is  apparent 
from  figure  2-1.  Figure  3-1  shows  the  path  of  the  macliinc  in  (7  space  as  it  descends  a  ravine  (the  learning 
algorithm  was  modified  so  that  motion  was  restricted  to  this  cross  section  of  the  particular  7  dimensional  (7 
space  of  tlie  l-l-l-l  F.ncoder  in  order  to  produce  this  figure).  It  is  apparent  that  there  is  more  side-to-sidc 
movement  than  movement  downhill.  This  slosliing  results  because  die  sides  of  a  ravine  arc  much  steeper  dian 
the  bottom,  and  the  "sideways  gradient"  dominates  the  desired  gradient  along  die  ravine  except  at  the  very 
bottom. 

3.5.  Temporal  Filtering:  Giving  the  Weights  Momentum 

One  of  die  ravine  search  tecliniques  involves  low  pass  filtering  in  time  by  averaging  current  estimates  for  p,j 
and  pre  us  cs’.imatcs,  some  of  which  were  taken  on  one  side  of  the  ravine,  some  on  the  other. 

Thus  the  sideways  gradient  will  tend  to  cancel  out.  lessening  the  sloshing. 

In  addidon  to  die  improvements  in  ravines,  where  successive  learning  cycles  have  very  different  statistics  due 
to  sloshing,  it  also  improves  learning  of  weights  whose  coocuircnccs  change  little  between  successive  cycles.  In 
these  cases,  the  effect  is  similar  to  increasing  the  sampling  time  within  cycles,  and  reduces  the  variance. 
Excessive  variance  is  the  cause  of  two  serious  problems:  getting  lost  (sec  section  4.2.4),  and  suicide  (see  section 
6). 

3.6.  Spatial  Filtering:  Smoothing  G*Space 

An  interesting  possibility  is  to  low  pass  filter  G  before  searching  it.  Since  the  sides  of  ravines  become  higher 
farther  from  die  origin  in  G.  die  ravines  themselves  will  slope  up  in  filtered  G.  This  will  keep  the  weights 
small,  which  is  a  big  advantage:  At  large  radii,  it  is  a  long  way  from  one  ravine  to  another,  and  the  high  ridges 
between  ravines  (caused  by  die  large  weights)  also  preclude  getting  equilibrium  statistics,  upon  which  the 
whole  optimi/.ation  procedure  is  based. 

Further,  it  is  easy  to  simulate  the  filtering.  Adding  gaussian  noise  to  each  weight  will  result  in  behavior  which 
is  an  average  of  that  obtained  with  nearby  combinations  of  weights.  If  the  noise  is  proportional  to  die 
magnitude  of  the  weight,  it  will  have  the  following  effect:  if  two  weights  must  differ  by  a  constant,  the 
variance  of  the  difference,  and  hence  G,  will  be  lower  when  the  magnitudes  of  the  weights  are  small.^ 


g 

Several  techniques  for  searching  lopocraphics  with  r.avincs  arc  discu  .scd  m  (Gclfand  and  I  'ctlin,  1966). 
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Figure  3-1:  Path  of  Machine  down  a  Ravine 


3.7.  Explicitly  Keeping  the  Weights  Small 

We  can  also  keep  the  search  confined  to  the  area  near  the  origin  by  adding  a  term  to  the  cost  function  which 
penalizes  large  weights,  For  instance 


Wj/ 

ij 


^G„n,  _  dG 
d  Wy  a  Wy 


+  hWy 


To  do  gradient  descent  in  this  new  space,  we  proceed  as  before,  and  in  addition  subtract  from  each  weight  a 
fraction  of  its  value. 


Very  small  values  of  h  seem  to  be  sufficient  to  keep  the  weights  small. 


3.8.  Adding  noise  to  the  environmental  input 

In  addition  to  altering  the  search  tcclmiqucs,  there  arc  alternative  methods  of  gatlicring  the  statistics  from 
which  the  gradient  of  G  is  estimated.  A  technique  designed  to  keep  the  weights  small  is  to  occasionally  clamp 
patterns  during  wake  which  do  not  occur  in  the  environment  (more  accurately,  the  environment  is  modified 
so  that  many  or  all  of  the  patterns  have  a  finite  probability).  This  avoids  the  following  problem:  If  certain 
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patterns  never  (xrciir  in  the  environment,  they  must  be  given  infinitely  higher  energy  than  tlie  ones  that  do 
occur.  This  requires  infinite  weights.  One  must  be  careful  to  modify  the  environment  in  a  way  that  die 
Holt/.mann  Machine  can  easily  model;  if  it  has  to  devote  some  of  its  representation  capacity  to  modelling  the 
structure  of  this  environmental  noise,  the  technique  will  be  counter-productive. 

3.9.  Choosing  Environmental  Patterns 

Originally,  we  clamped  all  the  environmental  patterns  for  a  period  proportional  to  their  probabilities.  For 
environments  with  many  patterns,  however,  diis  becomes  prohibitively  slow,  since  an  annealing  must  be  done 
every  time  a  new  pattern  is  clamped.  Randomly  choosing  patterns  for  clamping  introduces  a  lot  of  noise  into 
tlie  suitistics,  but  seems  unavoidable.  Using  temporal  filtering  in  conjuction  with  random  pattern  selection 
may  reduce  the  variance  back  to  an  acceptable  Icvcl.*^ 

3.10.  Partial  Clamping  during  Sleep 

When  we  arc  sampling  the  machine's  model  of  its  environment,  the  visible  units  arc  not  clamped.  Thus  p\j  is 
always  estimated  from  random  samples,  and  will  be  noisier  than  estimates  of  p,j.  To  combat  this,  wc  can 
encourage  the  correct  distribution  of  samples  by  clamping  a  subset  of  the  visible  units  during  sleep.  The 
clamped  units  will  then  behave  according  to  the  environment,  and  tlie  sleep  and  wake  statistics  of  the  weights 
connecting  them  will  be  identical.  Thus  these  weights  will  not  change.  However,  the  rest  of  the  weights  will 
obtain  statistics  with  a  lower  variance. 

In  some  situations,  the  Boltzmann  Machine  will  have  some  visible  units,  O.  dedicated  to  output,  and  some.  /, 
to  input  An  environment  then  specifics  a  set  of  conditional  probabilities  of  the  form  In  this  case, 

there  is  no  need  for  the  machine  to  learn  the  stnicture  among  the  input  units,  since  they  will  always  be 
clamped.  All  the  weights  can  be  used  for  modelling  the  structure  between  the  inputs  and  outputs.  Thus  sleep 
clamping  can  produce  faster  learning  for  this  special  case.  The  appropriate  G  measure  in  this  case  is 

Similar  mathematics  apply  in  this  formulation  and  8  G/  9  Wy  is  the  same  as  bcforc,^^ 

For  a  knowledge  representation  problem  with  32  environmental  patterns.  10  of  which  arc  clamped  during 
each  learning  cycle,  tlie  machine  was  not  able  to  learn  the  environmental  distribution  due  to  die  high  noise. 
Using  sleep  clamping  the  learning  preceded  smoothly. 


section  3.5 
'^Hinton  ctal.  1984 
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4  Empirical  Tests  of  the  Search  Methods 

4.  (.  Problem  Description 

l  o  check  our  analytic  results  and  compare  search  methods,  a  very  small  problem  was  investigated,  i  r.ere  are 
four  units,  two  visible  and  two  hidden,  and  seven  weights.  In  the  environment,  the  two  visible  Ui-'Ls  arc 
always  either  both  on  or  both  off.  The  hidden  units  must  encode  the  state  of  the  visible  units.  They  form  a 
ch.'.nncl  through  which  the  visible  units  can  communicate  their  state.  Figure  4-1  shows  the  layout  of  the 
pr(  blcra. 


visible  hidden 


nid'je.. 


visible 


PATTERN  PROBABILITY 

Unit  1  Unit  2  Unit  3  UnitA 

on  X  X  on  .50 

off  .<  X  off  .50 

Figure  4-1:  Architecture  &ud  Environment  for  the  1-M-l  Encoder 

Tliis  problem  is  the  minimal  case  of  the  encoders  described  in  (Hinton,  ct  al.  1984)  since  each  visible  group 
consists  of  only  one  unit  The  channel,  however,  consists  of  two  groups  of  hidden  units,  ncitlier  of  which  can 
directly  perceive  both  visible  groups.  This  makes  it  much  harder  for  random  weight  changes  to  increase  the 
correlation  between  visible  units  and  be  reinforced.  With  less  environmental  influence,  more  of  the 
undesirable  tendencies  toward  suicidal  behavior  are  exhibited  (sec  section  6).  This  is  proving  useful  for 
gaming  a  better  understanding  of  the  previously  unexplained  behavior  on  larger  problems. 

For  this  si/.c  system,  it  was  possible  to  analytically  determine  p,j  and  p*,j,  and  do  error  analysis.  Comparing 
statistics  derived  analytically  with  those  found  by  a  simulated  real  machine  turned  up  some  subtle  bugs  due  to 


11 


round-off  error.  Holtzmann  Machines  arc  extremely  sensitive  to  small  consistent  errors,  even  when  buried  in 
large  amounts  of  statistical  noise.  This  makes  such  errors  difficult  to  detect  except  by  careful  analysis. 

In  addition  we  were  able  to  investigate  the  adequacy  of  our  annealing  schedules  by  comparing  die  analytic 
statistics,  which  reflect  the  true  equilibrium  distribution  of  states  as  given  by  the  Bolt/mann  distribution,  to 
those  of  a  simulated  real  machine,  which  only  asymptotically  approach  equilibrium.  I'hc  results  of  this 
investigation  arc  discussed  in  section  5.1. 

Sets  of  weights  which  model  the  environment  must  include  either  three  positive  weights  between  the  four 
units,  or  one  positive  weight  and  two  negative  weights.  The  bias  weights  must  then  be  adjusted  so  that  their 
unit  is  on  about  half  the  time  (the  environmental  probability). 

4.2.  Results 

Due  to  the  large  number  of  parameters,  comparisons  were  done  by  varying  only  one  parameter  at  a  time.  The 
control  values  were  as  follows: 

•  Ten  independent  samples  were  taken  during  both  sleep  and  wake  in  order  to  estimate  pij  and  p'y. 

•  At  each  learning  cycle,  the  weights  were  modified  so  as  to  travel  a  distance  of  .5  in  (7-spacc  in  the 
direction  of  the  estimated  gradient. 

•  Rach  of  the  two  patterns  was  shown  in  its  correct  form  during  wake  for  an  equal  period,  and  tltere 
was  no  clamping  during  sleep. 

•  The  annealing  schedule  is  shown  in  figure  4-2.*^  After  each  annealing,  a  single  sample  was  taken 
at  the  final  temperature,  1.0 

•  The  weights  were  each  initiali/'cd  to  random  values  uniformly  distributed  on  [-2.0,  2,0],  prior  to 
each  run. 


TIME 

TEMPERATURE 

4.0 

2.0 

6,0 

1.5 

8.0 

1.2 

10.0 

1.0 

I'igiirc  4-2:  Annealing  Schedule  for  l-l-l-l  Encoder  Problem 
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In  one  unit  of  lime  all  of  ihe  iiniis  are  probed  once  each  (on  average).  When  probed,  a  unit  deadcs  whether  to  be  on  or  off  based  on 
energy  gap. 
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Rach  of  the  following  graphs  represent  ten  mns  of  500  learning  cycles.  For  this  problem, 

G=  Poff\Oi  ^  +  Pon  '08 

for  uncorrelated  behavior  between  tlic  two  visible  uniLs.*'^  Because  (7  varies  tremendously  from  one  learning 
cycle  to  the  next,  tlie  data  was  smoothed  before  plotting,  by  averaging  with  nearby  values.'^  The  values  for  p,j 
and  p'jj  used  to  calculate  (7  were  found  analytically,  since  the  variance  of  tlic  estimate  from  the  simulated 
machine,  using  only  10  samples,  is  extremely  high. 

Figure  4*3  shows  the  results  obtained  with  the  control  paramenter  values,  to  which  each  of  the  other  graphs 
should  be  compared.  In  each  graph,  it  is  clear  when  the  machine  finds  the  basic  structure  of  the  environment. 
The  main  things  to  note  arc  how  many  successful  runs  there  arc  out  of  ten  trials,  and  how  early  the  success 
becomes  evident. 

4.2.1 .  Number  of  Samples 

The  number  of  samples,  N,  used  to  estimate  p,j  and  p\j  was  varied  between  five  and  twenty.  As  can  be  seen 
from  figure  4*4,  the  number  of  successes  generally  increases  slightly  with  increase  in  number  of  samples, 
while  the  variablity  in  (7  over  short  periods  decreases.  Since  time  to  estimate  p,j  and  p',j  varies  directly  with  N, 
ten  looks  like  a  good  compromise  between  number  of  successes  and  amning  time.  I  think  ten  will  remain  a 
good  value  for  larger  problems.  Since  getting  independent  samples  requires  an  annealing  for  each  sample, 
and  hence  takes  longer  than  getting  dependent  samples,  there  is  a  tradeoff  between  having  many  dependent 
samples  or  few  independent  samples,  given  a  fixed  sampling  time.  Gail  Gong  (pei'sonal  communication)  has 
determined  the  optimal  number  of  annealings  per  sample  period  in  terms  of  tlic  variances  of  dependent  and 
independent  samples,  but  no  empirical  work  has  been  done  to  see  how  accurately  these  variances  can  be 
estimated. 

4.2.2,  Choosing  patterns 

The  machine  has  a  much  harder  time  when  the  environmental  patterns  are  picked  randomly,  albeit  with  the 
correct  probability  (figure  4-5a).  Unfortunately,  for  large  problems,  there  are  too  many  patterns  for  tliem  all 
to  be  shown  on  each  learning  cycle.  The  techniques  of  clamping  some  units  during  sleep  and  temporal 
filtering^®  both  tend  to  reduce  the  variance  and  may  alleviate  this  problem.  Figure  4-5b suggests  that  sleep 
clamping  helps  when  each  pattern  is  shown  evciy  cycle.  It  may  have  even  more  effect  in  the  case  of  random 
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It  is  lower  when  randomness  is  introduced  in  the  clamping,  (/becomes  .37  and  22  ■espccti'.cb  for  a  clamping  noise  (explained  in 
section  4.2.3)  of  .05  and  .10  when  the  visible  unii.s  arc  uncorrelated. 

^^specifically,  the  data  was  convolved  with  a  triangle  function  of  length  seven 

^®scc  section  3,5 
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Figure  4*3:  Performance  Obtained  with  the  Control  Parameters 


4.2.3.  Environmental  Noise 

One  model  for  introducing  noise  in  the  environment,  in  the  quest  to  combat  unbounded  weight  growth^^,  is 
to  give  each  visible  unit  a  probability  to  be  clamped  incorrectly.  This  probability  is  termed  the  clamping 
noise. 


Boltzmann  machines  represent  ratios  of  probabilities  between  states  as  energy  differences  between  those 
states.  With  clamping  noise,  the  probability  ratios  are  relatively  small,  and  change  rather  slowly  as  we  move 
to  patterns  farther  away  in  Hamming  distance.  Thus  small  energy  gaps  between  adjacent  patterns  will  suffice. 
With  small  energy  gaps,  the  machine  moves  rapidly  through  K-space,  and  it  is  easy  to  get  a  good  sample  of  the 
equilibrium  values  of  p,j  and  p',y.  In  geographic  terms,  the  bottoms  of  the  ravines  flatten  out  and  begin  to  rise 
again,  rather  than  continuing  to  descend  in  the  direction  of  ever  larger  weights.  Thus  we  can  argue  that  a 
Boltzmann  machine  performs  bettor  with  clamping  noise. 
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see  section 
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Figure  4-5:  Effect  of  Clamping  Strategy  on  Performance 

The  results  in  tigurc  4-6(a)‘®  seem  to  indicate  a  second  advantage  of  clamping  noise.  .As  the  hidden  units 
randomly  modify  their  weights  and  form  constraints  between  the  visible  units,  they  'a  ill  initially  not  get  them 
right,  quantitatively,  even  if  the  signs  arc  correct.  It  seems  that  the  clamping  noise  reduces  the  penalty  for 
these  attempts  to  use  the  hidden  units  by  allowing  for  a  certain  number  of  atypical  states.  It  thus  reduces  the 
tendency  toward  dissociation  (see  section  5.1.1).  Figure  4-6(b)  suggests  that  too  much  clamping  noise  obscures 
the  pattern  to  the  point  that  it  is  not  learned  in  the  number  of  cycles  given. 

4.2.4.  Size  of  Weight  Step 

Changing  the  size  of  the  weight  step  seems  to  be  the  most  sensiuve  parameter  (figure  4-7),  and  is  one  whose 
optimal  value  may  be  expected  to  change  drastically  from  one  problem  to  another.  ITicre  are  two  sots  of  data 
here,  data  when  the  weight  step  is  fixed,  and  data  when  it  is  proportional  to  the  gradient.  The  magnitude  of 
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Note  thii  the  y  axis  scales  have  been  adjusied  so  that  (7  for  uncorrelaied  behavior  is  at  about  the  same  height 


Clamping  Noise  =  .05 
Figure  4-6: 


b  Clamping  Noise  =  .10 
[■ffcct  ot  Environmental  .Noise  on  Performance 
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Figure  4-7.  I  ffcct  Ilf  Weight  Step  on  Performance 
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the  gradient  is  less  than  the  square  root  of  the  number  of  weights. With  7  weights,  |  V  (7|<2.6.  Usually,  it 
was  observed  to  be  much  lower,  around  1.  Thus,  tlie  distances  moved  on  runs  with  a  fixed  step  of  .5  (figure 
4*  3)  and  on  runs  with  a  proportional  step  of  7IVGI  (figure  4-7c)  are  about  the  same,  on  average.^® 
Reasonably  enough,  these  two  did  about  the  same  overall.  With  proportional  steps,  the  machine  is  quite 
happy  to  stay  pretty  much  where  it  is  when  the  gradient  is  small.  Thus  the  variance  in  0  is  smaller  for  ains 
which  haven’t  yet  got  the  structure  of  die  environment  than  it  is  for  the  case  of  fixed  steps.  Conversely,  when 
it  is  obvious  what  to  do.  the  proportional  step  runs  show  a  higher  variance  for  G.  The  fact  that  fixed  steps  lead 
to  wider  searching  when  the  gradient  is  small  and  there  is  nothing  obvious  to  do  seems  to  be  an  advantage  for 
this  technique. 

Unfortunately,  searching  widely  with  little  guidance  from  the  gradient  is  likely  to  lead  the  machine  far  from 
the  origin.  In  this  never-never  land  of  large  weights,  the  machine  can  no  longer  reach  equilibrium,  and  the 
nice  monotonic  behavior  of  G,  even  for  large  weights,  evidenced  in  figure  2-1  is  for  naught.^^  This  behavior 
can  be  seen  in  the  graph  for  weight  step  =  .7.  If  a  nin  doesn’t  luck  into  a  good  combination  of  weights  early 
on,  it  gets  lost  and  never  finds  one.  The  basic  result  is  this:  In  a  problem  where  one  need  only  go  down  in 
G-space,  it  is  best  not  to  go  too  far,  lest  one’s  estimate  be  wrong,  or  the  gradient  change  too  much  along  the 
way. 

5.  Reaching  Equilibrium 

5.1 .  Necessity  of  getting  Equilibrium  Statistics 

Equation  1  is  based  on  the  assumption  that  pyund  p',j  follow  from  Bolt/.mann  distributed  global  probabilities. 
If  we  do  not  anneal  long  enough  before  sampling,^^  will  be  dependent  on  the  starting  state  as  well  as  the 

est 

true  Pa,  resulting  in  erroneous  py  and  p',j.  An  example  which  comes  up  often  in  simulations  is  the  test  tube 
space  (see  figure  5-1). 

The  annealing  process  begins  with  the  machine  in  a  random  state.  It  is  equally  likely  to  be  within  the 
collecting  area  for  minimum  A  or  for  minimum  B.  since  we  hypothesize  tlic  same  number  of  states  in  each. 
Assuming  that  annealing  proceeds  too  fast  to  reach  equilibrium,  the  machine  will  fall  to  tlie  bottom  of 
whichever  minimum  it  began  in  and  stay  there.  In  this  case,  we  are  measuring  the  proportion  of  states  with 

19 

see  appendix  I 
204 

7  is  twice  the  bound  given  by  equation  2  with 

21 

Section  5.1  contains  more  discussion  of  the  necessity  of  reaching  equilibrium. 
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Ixng  enough  is  approximately  the  recurrence  time  fora  random  global  state. 
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A  B 

Figure  S-l:  Energy  Space  which  Demonstrates  the  Effect  of  Insufficient  Annealing 
downhill  paths  into  each  minimum,  rather  than  the  relative  dcptlis  of  each  minimum. 

'I'his  effect  is  evident  in  figure  5-2.  The  upper  curve  is  G  calculated  analytically  under  the  assumption  of 
equilibrium  statistics.  The  lower  curve  is  G  estimated  from  the  statistics  actually  collected  by  the  machine. 
Even  where  the  analytic  G  is  very  high,  the  sleeping  and  waking  statistics  are  nearly  identical.  The  machine 
would  be  able  to  complete  environmental  patterns  quite  well. 

One  may  therefore  ask  whether  it  is  essential  to  reach  equilibrium;  all  we  really  want  is  for  the  sleeping  and 
waking  statistics  to  converge  so  we  have  a  learning  associative  memory  device.  We  can  use  the  fact  that  the 
collecting  areas  of  and  Hamming  distances  between  minima  arc  important,  and  learn  things  that  the 
theoretical  Boltzmann  Machine  can’t.^^  The  answer  seems  to  be  no,  though  problems  show  up  only  under 
extreme  conditions  where  the  weights  arc  large.  Table  5-1  shows  two  sets  of  coocurrencc  statistics  for  a 
machine  which  is  on  the  borderline  of  trouble.  Due  to  the  large  w  eights,  the  machine’s  statistics  arc  very  close 
to  the  environmental  ones,  though  the  true  equilibrium  statistics  arc  not. 

5.1.1.  Case  Analysis 

The  problem  has  to  do  with  the  way  non-equilibrium  statistics  compare  with  equilibrium  ones.  Consider  the 
energy  space  for  the  encoder  problem.  There  arc  two  minima,  one  for  each  of  the  environmental  states  as  in 
figure  5-1.  Due  to  the  symmetry  of  the  architecture,  the  areas  of  the  minima  are  the  same;  thus  as  the 
machine’s  statistics  diverge  from  equilibrium,  they  will  approach  50%  in  each  minimum,  independent  of  the 
relative  depths.  By  happy  accident,  this  is  the  same  percentage  in  each  state  that  occurs  in  the  environment. 
To  investigate  how  the  machine  handles  itself  when  it  must  actively  maintain  the  correct  ratio  between 
patterns,  the  environment  was  changed  so  that  the  visible  units  arc  off  60%  and  on  40%.  Now,  as  the  weights 
increase,  the  machine’s  statistics  become  closer  to  50%  than  the  equilibrium  statistics.  As  a  result,  the 
difference  in  dcpUi  of  the  minima  increases.  Eventually  the  space  looks  like  figure  5-3. 

^^David  Ackley,  personal  communication. 
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Connected  Units 

Weight 

P 

P’ 

ANALYTIC 

1 

(bias) 

11.4 

.50 

.90 

2 

(bias) 

1.6 

.50 

.10 

3 

(bias) 

1.1 

.50 

.10 

4 

(bias) 

10.3 

.50 

.90 

1 

2 

-18.2 

.00 

.00 

2 

3 

16.7 

.50 

.10 

4 

3 

-18.9 

.00 

.00 

SIMULATED 

1 

(bias) 

11.4 

.50 

.50 

2 

(bias) 

1.6 

.50 

.50 

3 

(bias) 

1.1 

.50 

.49 

4 

(bias) 

10.3 

.50 

.51 

1 

2 

-18.2 

.00 

.00 

2 

3 

16.7 

.50 

.49 

4 

3 

-18.9 

.00 

.00 

Table  5*1:  Sample  Sleeping  and  Waking  Coocurrence  Statistics 
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Figure  5*3:  Worsening  Energy  Space 

Now  a  tiny  change  in  the  height  of  the  barrier  makes  a  tremendous  difference  in  the  states  probabilities.  We 
tlius  have  a  tense  combination  of  wciglits;  large  in  magnitude,  and  precisely  coordinated  in  relative  terms. 
This  in  itself  is  not  sufficient  for  disaster,  however.  Another  effect  of  non-equilibrium  prevents  the  weights 
from  correcting  perturbations.  Once  the  hidden  units  have  reached  a  state  where  one  is  on  and  one  is  off, 
they  are  unlikely  to  change,  even  if  the  positive  weight  is  large(see  figure  5-4).  The  visible  unit  is  much  more 
responsive  to  changes  in  the  value  of  the  positive  weight.  This  is  because  there  is  no  barrier  to  be  crossed 
when  the  visible  unit  flips,  as  there  is  when  both  hidden  units  flip.  To  generalize,  the  probability  ratio 
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between  two  states  is  closer  to  the  equilibrium  ratio  for  states  which  arc  close  in  Hamming  distance. 

visible  hidden  hidden  visible 


biases 

Figure  5-4:  Weights  Associated  with  Poor  Energy  Space 


Eventually,  random  sampling  errors  will  lead  to  weight  changes  which  cause  the  more  probable  state  (both 
off)  to  occur  almost  all  the  time.  To  correct  this,  the  machine  will  lower  the  weights  to  both  visible  units  and 
modify  the  weights  to  the  hidden  units  so  as  to  equalize  the  state  probabilities.  'I'hc  latter  process,  however, 
takes  place  much  more  slowly  than  the  former,  and  the  net  effect  is  that  the  visible  units  dissociate  themselves 
from  the  hidden  units,  which  then  have  no  incentive  to  change;  dicy  continue  to  always  be  in  the  same  state. 
I'his  behavior  is  termed  signal  driven  suicide  (to  distinguish  it  from  suicide  driven  by  random  noise  as 
discussed  in  section  6).  The  dissociation  is  generally  stable  since  during  both  sleep  and  wake  the  statistics  will 
be  the  same  for  the  hidden  units,  either  100%  or  0%  on.  'fhus  the  estimated  value  of  9  (7/  9  w/y  will  be  0. 

Figure  5*5  shows  the  behavior  of  the  machine  for  an  asymmetric  environment.  The  dotted  curve  shows  the 
value  of  G  estimated  from  the  machine’s  statistics;  tlie  solid  curve  shows  the  analytic  value  of  G.  The  behavior 
is  the  same  as  in  figure  5-2,  until  the  asymmetry  causes  the  visible  units  to  dissociate  from  the  hidden  ones 
after  1300  cycles. 

5.2.  Ensuring  Equilibrium 

A  promising  technique  for  discouraging  the  type  of  behavior  discussed  in  section  5.1  is  to  use  two  annealing 
schedules.  When  the  weights  begin  to  get  large  and  create  a  test  tube  like  energy  space,  the  less  conservative 
annealing  schedule  will  begin  to  generate  poor  statistics  before  tlic  more  conservative  one.  The  difference 
between  statistics  is  then  used  to  coax  the  weights  away  from  combinations  of  values  that  make  it  hard  to 
reach  equilibrium. 

Suppose  we  have  two  deep  minima,  A  and  B,  and  that  we  have  two  extreme  annealing  schedules.  One 
quenches  the  system,  going  directly  from  infinite  to  zero  temperature,  and  thus  measures  the  relative  areas  of 
the  minima.  The  other  is  slow  enough  to  reach  equilibrium,  and  thus  measures  the  relative  depth  of  the 
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minima. 

Now  wc  can  insert  special  learning  cycles  where  quenching  takes  the  place  of  the  normal  wake  cycle,  and  no 
units  arc  clamped.  After  collecting  statistics,  wc  modify  the  weights  in  the  normal  manner; 

Aw/y  =  Y‘^^''ifasrP'>hloy) 

If  the  depth  of  a  minimum  is  too  large  for  its  collecting  area,  all  weights  between  units  that  cooccur  in  that 
minimum  will  be  reduced,  and  vice  versa  for  overly  shallow  minima.  The  effect  is  to  change  tltc  depth  of  the 
minima  so  that  the  equilibrium  statistics  match  those  obtained  with  the  faster  schedule.  Since  the  total  areas 
of  minima  remain  relatively  constant  with  time,  it  prevents  any  minimum  from  getting  too  deep.  If  wc  use  a 
moderately  fast  anncling  schedule  instead  of  a  very  fast  quench,  the  special  learning  cycle  will  have  no  effect 
on  minima  until  they  get  too  deep  for  the  fast  schedule  to  reach  equilibrium.  As  this  point  is  approached,  the 
special  learning  cycle  will  prevent  further  deepening. 

A  reasonable  implementation  for  these  extra  cycles  w'ould  seem  to  be  to  alternate  them  with  standard  learning 
cycles.  If  wc  are  willing  to  use  the  same  epsilon  for  both  cycles,  wc  can  save  time  and  combine  the  two. 
Notice  that  the  net  weight  change  after  a  standard  cycle  followed  by  a  special  cycle  is 

If  wc  use  two  fast  annealing  schedules  in  the  standard  cycle,  the  second  and  third  terms  cancel.  Wc  can  get 
the  effect  of  the  two  types  of  cycles  simply  by  using  the  fast  schedule  during  the  wake  phase  of  the  standard 
cycle,  and  the  slow  schedule  during  the  sleep  phase. 

This  technique  is  not  robust  in  the  sense  that  if  a  very  bad  energy  space  develops,  it  cannot  recover,  because 
even  the  conservative  schedule  provides  poor  statistics.  The  encoder  problem  was  run  with  this  technique, 
and  the  weights  remained  small.  The  bound  on  the  weights  could  be  varied  by  changing  the  difference  in 
speed  between  the  two  annealing  schedules.  Similar  results  were  obtained  on  a  larger  problem  involving  37 
units  and  5S9  weights. 

6.  Suicide 

One  context  in  which  suicide  occurs  is  when  hidden  units  get  little  feedback  from  the  environment.  'Ihis 
difficulty  was  encountered  early  on,  but  was  not  understood  until  recently.  Hidden  units  tend  to  develop  all 
positive  or  all  negative  weights  and  consequently  arc  either  always  on  or  always  off. 


This  effect  is  best  explained  by  an  analogy:  Nearly  all  the  loose  gravel  on  a  busy  road  accumulates  at  the  side 


25 

of  the  road,  even  if  the  road  is  flat.  This  is  not  because  cars  selectively  push  gravel  toward  the  nearest  side.  It 
is  because  a  piece  of  gravel  has  a  much  higher  probability  of  making  a  move  when  it  is  near  the  middle  of  the 
road,  so  it  spends  very  little  time  in  the  middle.  Similarly,  when  a  hidden  unit  comes  on  about  half  the  time 
there  will  be  a  very  high  variance  in  Pjj  and  p’jj,  so  \Pij-p',j\  will  be  large  and  tlic  weights  will  change  a  lot. 
When  the  unit  is  almost  always  on  or  almost  always  off,  there  will  be  very  little  variance  and  the  weights  will 
remain  fixed,  'fhis  effect  can  overpower  the  systematic  effect  due  to  the  true  value  of  0  (7/  9  w/y. 

Since  this  explanation  for  suicide  depends  on  the  fact  that  the  magnitude  of  tlie  change  to  a  weight  is  a 
function  of  the  weight,  it  was  thought  that  renioving  this  dependence  could  solve  the  problem.  This  may  be 
accomplished  by  estimating  the  standard  deviation  of  the  estimates  of  py  and  p',j,  and  dividing  by  it  to 
determine  the  change  to  the  weight  Results  so  far  have  been  negative,  however. 

The  suicide  problem  may  be  reduced  by  using  more  samples  in  tlie  estimate  of  Py  and  p'y  than  were  taken  in 
the  current  learning  cycle  as  mentioned  in  section  3.5,  as  well  as  by  the  techniques  discussed  above  for 
keeping  the  weights  small. 

7.  Conclusion 

7.1.  Robustness 

The  results  developed  above  seem  to  be  generally  applicable  to  Boltzmann  Machines.  The  techniques 
requiring  problem  dependent  constants,  namely  constant  distance  weight  modification  (section  3.2)  and 
explicitly  keeping  the  weights  small  (section  3.7),  were  found  to  be  unnecessary.  Work  on  a  37  unit  problem 
has  indicated  that  the  best  values  for  parameters  found  above  can  be  successfxilly  used  unchanged;  these 
include  the  annealing  schedules,  number  of  patterns  per  learning  cycle,  epsilon,  and  amount  of  temporal 
filtering.  Many  techniques  used  formerly  to  prevent  unbounded  weight  growth  are  rendered  unnecessary  by 
the  single  technique  of  special  learning  cycles  using  slow  and  fast  annealing  cycles  (section  5.2).  These  old 
techniques  include  explicitly  keeping  the  weights  small,  noise  in  the  environment  (section  3.8),  and  others  too 
specialized  for  treatment  here. 

7.2.  Future  Work 

The  most  important  result  about  learning  in  Boltzmann  Machines  is  that  there  exists  a  global  measure,  G,  of 
the  discrepancy  between  the  machine’s  model  of  its  environment  and  the  actual  environment,  and  that  the 
partial  derivatives  of  this  measure  with  respect  to  tlie  weights  arc  locally  computable.  For  problems  where  the 
environment  can  be  modeled  with  pairwise  constraints  among  the  visible  units,  G  is  concave  upward  and 
gradient  descent  works  well. 
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Upon  reflection,  it  should  not  bo  surprising  that  more  complicated  environments  that  need  hidden  units  to 
model  highet  order  constraints  among  the  visible  units  arc  difficult  to  model  without  directly  considering 
non-local  phenomena.  Incorporating  mcta-knowlcdge  about  useful  representations  into  an  cost  function 
could  perhaps  eliminate  the  need  to  search  many  of  tlic  local  niinima  found  in  G.  For  large  problems, 
hierarchical  representations  will  be  necessary.  "Concepts"  at  each  level  can  only  be  formed  after  those  at 
lower  levels  upon  which  they  depend.  At  each  level,  the  influence  of  tlic  evironment  is  weaker,  and  the 
gradient  of  the  evaluation  function  will  be  correspondingly  smaller.  Additional  constraints  to  select  among 
possible  ravines  will  be  invaluable.  Modifications  to  (7  which  encourage  small  weights  have  been  discussed. 
Encouraging  units  not  to  duplicate  tlie  behavior  of  other  units  may  also  bo  necessary.  Possibilities  such  as 
these  will  become  the  center  of  investigation  as  the  structure  of  Crspaco  and  the  performance  of  techniques 
for  searching  it  become  better  understood. 

Acknowledgements 

The  members  of  the  Boltzmann  Group  at  CMU  have  continually  guided  this  research  through  discussions 
and  suggestions.  I  am  particularly  grateful  to  Gooff  Hinton  for  assistance  and  encouragement,  and  to  David 
Ackley  for  writing  the  Boltzmann  Machine  simulator. 


V 


27 


Appendix^^ 

I.  Derivation  of  the  Smoothness  Result 

The  elements  of  the  Hessian  of  Gare  given  by 

a  Wj^a  wi  ~  a  Wfc  ^  ^ 

a  a 

where  ni-)  is  the  statistical  mean;  and  is  1  if  both  units  connected  by  weight  a  are  on,  0  otherwise.  Each 
term  is  restricted  to  [-  .25, .25],  so  1  /i*/|  S  .5.  Thus  the  maximum  curvature  in  any  direction, 

other  Smoothness  Results 

li^l  =  s  1 

since  py  and  p'y  are  probabilities.  Thus,  the  slope  along  any  axis  is  less  than  one,  and  the  gradient  must  satisfy 
|VC|  =  Vc^r-r+'-'+C-!^)"  S  vw 


where  H'is  the  number  of  weights.  Further, 


=  a-p't)p'ir  E  fJ-i-pM 


where  py  is  the  conditional  probability  that  units  i  and yare  both  on  (coocurrcncc  probability)  given  that  state 
a  is  clamped  over  the  visible  units.  Both  terms  are  restricted  to  [0,  .25],  so 

I  a^G  I 

'  a  Wy  '  ^  4 

II.  Derivation  of  the  Variance  of  the  Estimated  Gradient 

Mdl  VOa-  veil)’  =  ^[(pp^-p'iO-Ij’p-P'ij)? 

Let 

Assuming  a  zero  mean  normal  distribution  for  the  difference  between  estimates  (this  is  equivalent  to 
assuming  a  large  number  of  samples). 


R»ih  appendices  assume  a  icmperaturc  of  1  in  the  denvations. 
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Assuming  wc  have  N  independent  samples, 

ay)=mi±pdiL^^ 

assuming  all  units  arc  independently  on  half  the  time. 
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